Soft Clustering Criterion Functions for Partitional Document Clustering

نویسندگان

  • Ying Zhao
  • George Karypis
چکیده

Recently published studies have shown that partitional clustering algorithms that optimize certain criterion functions, which measure key aspects of interand intra-cluster similarity, are very effective in producing hard clustering solutions for document datasets and outperform traditional partitional and agglomerative algorithms. In this paper we study the extent to which these criterion functions can be modified to include soft membership functions and whether or not the resulting soft clustering algorithms can further improve the clustering solutions. Specifically, we focus on four of these hard criterion functions, derive their soft-clustering extensions, present a comprehensive experimental evaluation involving twelve different datasets, and analyze their overall characteristics. Our results show that introducing softness into the criterion functions tends to lead to better clustering results for most datasets and consistently improve the separation between the clusters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Agglomerative and Partitional Document Clustering Algorithms

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters, and in greatly improving the retrieval performance either via cluster-driven dimensionality reduction, term-weighting, or query expansion. This ever-increasing importance of do...

متن کامل

Department of Computer Science and Engineering University of Minnesota 4 - 192 EECS Building 200 Union Street SE Minneapolis , MN 55455 - 0159 USA TR 04 - 021 gCLUTO – An Interactive Clustering , Visualization , and Analysis System

Recently published studies have shown that partitional clustering algorithms that optimize certain criterion functions, which measure key aspects of interand intra-cluster similarity, are very effective in producing hard clustering solutions for document datasets and outperform traditional partitional and agglomerative algorithms. In this paper we study the extent to which these criterion funct...

متن کامل

Partitional Clustering Experiments on Document Datasets

The purpose of this study is evaluation and comparison of some criterion functions used for document clustering. Each function is evaluated by using different clustering methods and different datasets. Detailed experiments show that some clustering criterion functions perform better than rest. Results of experiments are also consistent with previous works which compares same criterion functions.

متن کامل

Hierarchical Clustering in Medical Document Collections: the BIC-Means Method

Hierarchical clustering of text collections is a key problem in document management and retrieval. In partitional hierarchical clustering, which is more efficient than its agglomerative counterpart, the entire collection is split into clusters and the individual clusters are further split until a heuristically-motivated termination criterion is met. In this paper, we define the BIC-means algori...

متن کامل

A Comparison of Two Document Clustering Approaches for Clustering Medical Documents

form of medical reports. Such documents contain important information about patients, disease progression and management, but are difficult to analyse with conventional data mining techniques due to their unstructured nature. Clustering the medical documents into small number of meaningful clusters may facilitate discovering patterns by allowing us to extract a number of relevant features from ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004